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Summary Report 


This goal of this project was the feasibility study of a particular architecture of a digital 
signal processing machine operating in real time which could do in a pipeline fashion the 
computation of the Fast Fourier Transform of a time-domain sampled complex digital data 
stream. The particular architecture is described in the enclosed paper “FFT Computation 
With Systolic Arrays - A New Architecture” (IEEE Trans, on Circuits and Systems- 
II:Analog and Digital Signal Processing, 41, p.278, 1994), and makes use of simple identical 
processors (called Inner Product Processors) in a linear organization called a systolic array. 
By definition systolic arrays consist of 1, 2 or more dimensional organizations of such 
processors where the only communication with the outside world is at the edge of the 
array, and the communication of data with other processors is exclusively with the closest 
neigbours. The system clock is common, it is applied system-wide to all the processors in 
synchronism. 

Many processing organizations using systolic arrays have been devised (see enclosed 
paper for a partial list and analysis of them), however, many of them are not economical, 
or do not work in a pipeline fashion: they require large storage memories in between 
processing stages, including at the input of the first processing stage. In many cases the 
access order of the stored elements is different when writing and when reading, hence they 
require two memory sets in a ping-pong arrangement for faultless access. 

The advantages of our proposed organization are multiple: it is very economical in 
hardware, requires no stream switching or data rearranging during the processing. It 
operates continuously without need of interrupting the incoming data stream even when 
new blocks of data are applied, and transformed complex sequences exit from the systolic 
array at the same rate as the sampled data comes in. The storage elements work in a 
sequential fashion, they could even be replaced with shift registers or with FIFO’s, no 
data resequencing is necessary anywhere. 
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An interesting application of this processing system came up with the consideration of 
this achitecture to be used for the correction of optical distortion due to the turbulence of 
the Earth’s atmosphere in the viewing volume of a telescope. This operation would be done 
by correcting for the distortion with a compensating system. Ultimately the compensating 
system could be a “rubber” mirror with an optical mirror whose optical figure is distorted 
by a set of electromechanical actuators located on the mirror back. Another compensating 
system could be simply the selection of observing time intervals for integration when the 
atmosphere is minimally perturbed. Another possibility is to invert the process and use of 
atmospheric turbulence to generate speckle interferometry, a way of obtaining high angular 
resolution observations. It is in this configuration that this project was worked out (G. 
Chin et al, 1988). 

Data obtained from a 64 by 64 point Charge Coupled Device is applied to a 2D FFT 
processor, which can be based on the FFT processor described above operating on a 64 
point data block. A detailed description of the processing sequence and scheme can be 
found in the Chin et al paper. 

At the beginning of this project the described algorithm was a theoretical conception, 
and although seemed a fully realistic scheme no effort had been done in testing its feasibility. 
This project comprised two phases: a simulation of the whole system on a computer where 
all bits were represented, and a study of the feasibility of the implementation of the Inner 
Product Processor. Both phases were pursued in parallel. 

Simulation* Simulation of the architecture was carried out in a computer program 
where the individual bits of the digital words of the data streams of the systolic array 
were represented separately. Assembled together into digital words they were used in the 
computation of the complex arithmetic of the Inner Product Processor C ou t — Ci n + A • B. 
Storage delays were represented by arrays of data words. Simulation was carried out in a 
fixed point computation with 22 bit accuracy, a FFT machine of 1024 points was assembled 
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in software. The input signal was a sine wave with a variable amount of gaussian noise 
added to the sine wave. This signal was processed with this architecture simulated in 
a C program. An accurate result in the frequency domain with the expected signal-to- 
noise ratio was obtained. In a second, more complex, simulation an input of a wide 
band impulse in the time domain was dispersed according to the law of inverse frequency 
squared of cold, tenuous plasma (interstellar plasma), this signal was applied to a 1024 
FFT simulated architecture, the frequency domain signal went through an all-pass digital 
filter that had a frequency characteristics opposite to that of the interstellar medium cold 
plasma filter, and a second 1024 point inverse FFT simulated architecture machine brought 
the signal back into the time domain. Comparison of the input and output signals showed 
correct operation of the simulated architecture. This work was done in collaboration with 
Sivakumar Makineni. 

Inner Product Processor. Two implementations of the integrated circuit doing 
the Inner Product Processor were laid out in 2 micron CMOS integrated circuits, one in 
floating point arithmetic (in collaboration with Peter DelVecchio and Wei Chen) and the 
other in fixed point aritmentic (in collaboration with Emad Afifi). Both used the maximum 
silicon area standard MOSIS chips may have, 7.9mm by 9.2mm. Both use standard p-well 
VLSI design. The complex arithmetic operations carried out by both implementations are: 


Gout, real — Ciri, real T A rea [ ' B rea [ A imag * Bimag 


c. 


— Cin, imag T A rea l • B{mag T Ai ma g • B rea l 


out, imag — '-'in, imag 


where A is the input data, B the FFT coefficients (twiddle factors), and C the processed 
data. 

Floating Point Inner Product Processor. Because the standard format of the digital 
words in the whole processing system was IEEE floating point format (IEEE Standard, 
1985) the main design of the Inner Product Processor was to be in this format. This 
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format is not very convenient for digital adders and multipliers, so format conversion was 
done inside the integrated circuit for arithmetic convenience, and then at the exit of the 
system the format was converted back into IEEE floating point to conform to the machine 
standard. The computation of the aritmetic operations is done in this architecture with one 
hardware 24 bit by 24 bit multiplier, and two hardware adders, one a 49 bit and the other 
an 8 bit. In addition to the actual additions in the products the exponents of the floating 
point number must be added. A l/isec operation cycle was the goal for this design, to meet 
this goal four computational cycles of 250nsec each have been implemented. This requires 
a 4MHz clock. Since 64 point transforms were required the values of the coefficients B of 
the FFT algorithm were computed on chip by a special circuit that had in account the 
location of this particular Inner Product Processor in the architecture. The location was 
encoded by hardwiring of inputs to the integrated circuit. To accomodate as many bits in 
parallel as possible in the inputs and the outputs to the integrated circuit a ceramic body 
with the maximum set of 132 pins was selected. 

Fixed Point Inner Product Processor . A fixed point version of the Inner Product 
Processor was also designed and built. This chip has a 22-bit accuracy, sufficient for 
FFT’s of up to 10 6 points. A somewhat different chip architecture was chosen: two units 
consisting of a multiplier and an adder each operate in parallel. The B coefficients are 
applied externally to this chip, hence there is no internal generator as there is in the floating 
point chip. For a goal of 1MHz operation the multiplexing and computation requires 5 
computational cycles of 200nsec each, hence a 5MHz clock is required. A small shift 
register was implemented at the output of the processor, in C ou t , Its purpose is to be used 
as interstage delay when necessary. The design was implemented with encapsulation in a 
ceramic package with the maximum number of available pins, 132. 

Integrated Circuit Simulation and Testing, Before sending the designs for fab- 
rications both were simulated in software. On the floating point version the standard 
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software simulation was run (RSIM), it consists of applying a set of random vector inputs 
to a circuit equivalent which was lifted from the layout itself, and comparing the output 
these random vectors produced with that of the expected output computed on the basis of 
what the circuit itself should do. It also had the formal verification procedure Nuprl run 
on part of the design: the Mantissa Adjustor and Exponent Calculator. Another round of 
simulation was run after final completion of the floating point version, and sevedral errors 
were corrected. A data rate of 1MHz and a clock rate of 4MHz were sustained correctly 
in the simulation. Simulation of the fixed point version was also carried out with RSIM, 
and the results verified in the same way. In the simulation the clock rate was raised from 
the 5MHz to lOMhz, and the data rate from 1MHz to 2 MHz without errors, showing that 
the design had ample operational margin. 

After fabrication the integrated circuits had to be tested, the problem arose in finding 
an integrated circuit tester capable of applying 132 signals simultaneously to a Device 
Under Test. After many aborted attempts to operate a Tektronix LT1000 tester, or obtain 
access to other testers with the necessary capability a solution was found in a commercial 
company, Testware, of Hudson, MA, wher? General Radio type 2286 integrated circuit 
and board testers were available. The maximum clock rate was 4MHz, but the number 
of available pins was substantially larger than 132. Both sets of integrated circuits were 
tested, the floating point version was found to be non-operating, with a different defects 
precluding operation on all 30-plus copies of the integrated circuit. The fixed point version 
operated succesfully, with the prediction from the test results of up to 4MHz clock that 
it would operate correctly up to 10MHz clock rate and 2MHz data rate. This part of the 
work was carried out with M. Lopresti. 

Conclusions 

1) Through computer simulation the new architecture to compute the FFT with Sys- 
tolic Arrays was proved to be viable, and computed the FFT correctly and with the 
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predicted particulars of operation. 

2) Integrated circuits to compute the operations expected of the vital node of the 
Systolic Architecture were proven feasible, and even with a 2 micron VLSI technology can 
execute the required operations in the required time. Actual construction of the integrated 
circuits was succesful in one variant (fixed point) and unsuccesful in the other (floating 
point). 
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